Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

نویسندگان

  • Qinbao Song
  • Martin J. Shepperd
  • Xiangru Chen
  • Jun Liu
چکیده

Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k -NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated 3 missingness mechanisms, 3 missing data patterns, and 5 missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k -NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A new imputation method for small software project data sets

Effort prediction is a very important issue for software project management. Historical project data sets are frequently used to support such prediction. But missing data are often contained in these data sets and this makes prediction more difficult. One common practice is to ignore the cases with missing data, but this makes the originally small software project database even smaller and can ...

متن کامل

A Comparison of Accuracy between Decision Tree and k-NN Algorithm

Data mining has many functionalities. One of the main functions of data mining is the classification that is used to predict the class and generate information based on historical data. In the classification, there is a lot of algorithms that can be used to process the input into the desired output, thus it is very important to observe the performance of each algorithm. The purpose of this rese...

متن کامل

Empirical Evaluation of Missing Data Techniques for Effort Estimation

Multivariate regression models have been commonly used to estimate the software development effort to assist project planning and/or management. These models require a complete data set that has no missing values for model construction. The complete data set is usually built either by using imputation methods or by deleting projects and/or metrics that have missing values (we call this RC delet...

متن کامل

P. Jönsson and C. Wohlin, "benchmarking K-nearest Neighbour Imputation with Homogeneous Likert Data", Empirical Software Engineering: an Benchmarking K-nearest Neighbour Imputation with Homogeneous Likert Data

Missing data are common in surveys regardless of research field, undermining statistical analyses and biasing results. One solution is to use an imputation method, which recovers missing data by estimating replacement values. Previously, we have evaluated the hot-deck k-Nearest Neighbour (kNN) method with Likert data in a software engineering context. In this paper, we extend the evaluation by ...

متن کامل

A Short Note on Using Multiple Imputation Techniques for Very Small Data Sets

This short note describes a simple experiment to investigate the value of using multiple imputation (MI) methods [2, 3]. We are particularly interested in whether a simple bootstrap based on a k-nearest neighbour (kNN) method can help address the problem of missing values in two very small, but typical, software project data sets. This is an important question because, unfortunately, many real-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of Systems and Software

دوره 81  شماره 

صفحات  -

تاریخ انتشار 2008